An Operating-System-Level Framework for Providing Application-Aware Reliability
نویسندگان
چکیده
Operating systems enable collecting and extracting rich information on application execution characteristics, including program counter traces, memory access patterns, and operating-system-generated signals. This information can be exploited to design highly efficient, application-aware reliability mechanisms that are transparent to applications. This paper describes the Reliability MicroKernel framework (RMK), a loadable kernel module for providing application-aware reliability and dynamically configuring reliability mechanisms installed in RMK. The RMK prototype is implemented in Linux and supports detection of application/OS failures and transparent application checkpointing. Experiment results show that the OS hang detection and application hang detection, which exploit characteristics of application and system behavior, can achieve 100% coverage and low false positive rates. Moreover, the performance overhead of RMK and the detection/checkpointing mechanisms is small (0.6% for application hang detection and 0.1% for transparent application checkpointing in the experiments).
منابع مشابه
Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملA framework for reliability-aware design exploration on MPSoC based systems
Applying system-level fault-tolerant techniques such as active redundancy is a promising way to enhance the system reliability for safety-related applications. Embedded system design using active redundancy is a challenging task that involves solving two major problems, namely finding the optimal redundancy configuration and mapping/scheduling of the application (including the redundant compone...
متن کاملApplication-Aware Reliability and Security: The Trusted ILLIAC Approach
Security and reliability are the key attributes in building highly trusted systems. System security violations (e.g., unauthorized privileged access or the compromising of data integrity) and reliability failures can be caused by hardware problems (transient or intermittent), software bugs, resource exhaustion, environmental conditions, or any complex interaction among these factors. To build a...
متن کاملTesting Reliable Distributed Applications Through Simulated Events
There are many distributed applications that Incorporate application-specific reliability algorithms which operate on top of general purpose networking. operating system and programming language facilities. We present a framework for application-level reliability testing suitable for a wide range of distributed applications, and desCribe how we've applied it to one particular application, Mercu...
متن کاملTowards an Integrated Framework for Reliability-Aware Embedded System Design on Multiprocessor System-on-Chips
Today’s integrated circuits are becoming more susceptible to faults due to effects caused by aggressive technology scaling. However, to meet the ever-increasing reliability requirements of safety-related applications, the system has to function correctly even in the presence of faults. This leads to the challenging problem of fault-tolerant system design. Here, the goal is to meet the required ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006